42 research outputs found
Partial Mobilization: Tracking Multilingual Information Flows Amongst Russian Media Outlets and Telegram
In response to disinformation and propaganda from Russian online media
following the Russian invasion of Ukraine, Russian outlets including Russia
Today and Sputnik News were banned throughout Europe. Many of these Russian
outlets, in order to reach their audiences, began to heavily promote their
content on messaging services like Telegram. In this work, to understand this
phenomenon, we study how 16 Russian media outlets have interacted with and
utilized 732 Telegram channels throughout 2022. To do this, we utilize a
multilingual version of the foundational model MPNet to embed articles and
Telegram messages in a shared embedding space and semantically compare content.
Leveraging a parallelized version of DP-Means clustering, we perform
paragraph-level topic/narrative extraction and time-series analysis with Hawkes
Processes. With this approach, across our websites, we find between 2.3%
(ura.news) and 26.7% (ukraina.ru) of their content originated/resulted from
activity on Telegram. Finally, tracking the spread of individual narratives, we
measure the rate at which these websites and channels disseminate content
within the Russian media ecosystem
Fast Internet-Wide Scanning: A New Security Perspective
Techniques like passive observation and random sampling let researchers understand many aspects of Internet day-to-day operation, yet these methodologies often focus on popular services or a small demographic of users, rather than providing a comprehensive view of the devices and services that constitute the Internet. As the diversity of devices and the role they play in critical infrastructure increases, so does understanding the dynamics of and securing these hosts. This dissertation shows how fast Internet-wide scanning provides a near-global perspective of edge hosts that enables researchers to uncover security weaknesses that only emerge at scale.
First, I show that it is possible to efficiently scan the IPv4 address space. ZMap: a network scanner specifically architected for large-scale research studies can survey the entire IPv4 address space from a single machine in under an hour at 97% of the theoretical maximum speed of gigabit Ethernet with an estimated 98% coverage of publicly available hosts. Building on ZMap, I introduce Censys, a public service that maintains up-to-date and legacy snapshots of the hosts and services running across the public IPv4 address space. Censys enables researchers to efficiently ask a range of security questions.
Next, I present four case studies that highlight how Internet-wide scanning can identify new classes of weaknesses that only emerge at scale, uncover unexpected attacks, shed light on previously opaque distributed systems on the Internet, and understand the impact of consequential vulnerabilities. Finally, I explore how in- creased contention over IPv4 addresses introduces new challenges for performing large-scale empirical studies. I conclude with suggested directions that the re- search community needs to consider to retain the degree of visibility that Internet-wide scanning currently provides.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/138660/1/zakir_1.pd
Watch Your Language: Large Language Models and Content Moderation
Large language models (LLMs) have exploded in popularity due to their ability
to perform a wide array of natural language tasks. Text-based content
moderation is one LLM use case that has received recent enthusiasm, however,
there is little research investigating how LLMs perform in content moderation
settings. In this work, we evaluate a suite of modern, commercial LLMs (GPT-3,
GPT-3.5, GPT-4) on two common content moderation tasks: rule-based community
moderation and toxic content detection. For rule-based community moderation, we
construct 95 LLM moderation-engines prompted with rules from 95 Reddit
subcommunities and find that LLMs can be effective at rule-based moderation for
many communities, achieving a median accuracy of 64% and a median precision of
83%. For toxicity detection, we find that LLMs significantly outperform
existing commercially available toxicity classifiers. However, we also find
that recent increases in model size add only marginal benefit to toxicity
detection, suggesting a potential performance plateau for LLMs on toxicity
detection tasks. We conclude by outlining avenues for future work in studying
LLMs and content moderation
Twits, Toxic Tweets, and Tribal Tendencies: Trends in Politically Polarized Posts on Twitter
Social media platforms are often blamed for exacerbating political
polarization and worsening public dialogue. Many claim hyperpartisan users post
pernicious content, slanted to their political views, inciting contentious and
toxic conversations. However, what factors, actually contribute to increased
online toxicity and negative interactions? In this work, we explore the role
that political ideology plays in contributing to toxicity both on an individual
user level and a topic level on Twitter. To do this, we train and open-source a
DeBERTa-based toxicity detector with a contrastive objective that outperforms
the Google Jigsaw Persective Toxicity detector on the Civil Comments test
dataset. Then, after collecting 187 million tweets from 55,415 Twitter users,
we determine how several account-level characteristics, including political
ideology and account age, predict how often each user posts toxic content.
Running a linear regression, we find that the diversity of views and the
toxicity of the other accounts with which that user engages has a more marked
effect on their own toxicity. Namely, toxic comments are correlated with users
who engage with a wider array of political views. Performing topic analysis on
the toxic content posted by these accounts using the large language model MPNet
and a version of the DP-Means clustering algorithm, we find similar behavior
across 6,592 individual topics, with conversations on each topic becoming more
toxic as a wider diversity of users become involved
Data-driven curation, learning and analysis for inferring evolving IoT botnets in the wild
© 2019 Association for Computing Machinery. The insecurity of the Internet-of-Things (IoT) paradigm continues to wreak havoc in consumer and critical infrastructure realms. Several challenges impede addressing IoT security at large, including, the lack of IoT-centric data that can be collected, analyzed and correlated, due to the highly heterogeneous nature of such devices and their widespread deployments in Internet-wide environments. To this end, this paper explores macroscopic, passive empirical data to shed light on this evolving threat phenomena. This not only aims at classifying and inferring Internet-scale compromised IoT devices by solely observing such one-way network traffic, but also endeavors to uncover, track and report on orchestrated âin the wildâ IoT botnets. Initially, to prepare the effective utilization of such data, a novel probabilistic model is designed and developed to cleanse such traffic from noise samples (i.e., misconfiguration traffic). Subsequently, several shallow and deep learning models are evaluated to ultimately design and develop a multi-window convolution neural network trained on active and passive measurements to accurately identify compromised IoT devices. Consequently, to infer orchestrated and unsolicited activities that have been generated by well-coordinated IoT botnets, hierarchical agglomerative clustering is deployed by scrutinizing a set of innovative and efficient network feature sets. By analyzing 3.6 TB of recent darknet traffic, the proposed approach uncovers a momentous 440,000 compromised IoT devices and generates evidence-based artifacts related to 350 IoT botnets. While some of these detected botnets refer to previously documented campaigns such as the Hide and Seek, Hajime and Fbot, other events illustrate evolving threats such as those with cryptojacking capabilities and those that are targeting industrial control system communication and control services
Stratosphere: Finding Vulnerable Cloud Storage Buckets
Misconfigured cloud storage buckets have leaked hundreds of millions of
medical, voter, and customer records. These breaches are due to a combination
of easily-guessable bucket names and error-prone security configurations,
which, together, allow attackers to easily guess and access sensitive data. In
this work, we investigate the security of buckets, finding that prior studies
have largely underestimated cloud insecurity by focusing on simple,
easy-to-guess names. By leveraging prior work in the password analysis space,
we introduce Stratosphere, a system that learns how buckets are named in
practice in order to efficiently guess the names of vulnerable buckets. Using
Stratosphere, we find wide-spread exploitation of buckets and vulnerable
configurations continuing to increase over the years. We conclude with
recommendations for operators, researchers, and cloud providers.Comment: Proceedings of the 24th International Symposium on Research in
Attacks, Intrusions and Defenses. 202
A Golden Age: Conspiracy Theories' Relationship with Misinformation Outlets, News Media, and the Wider Internet
Do we live in a "Golden Age of Conspiracy Theories?" In the last few decades,
conspiracy theories have proliferated on the Internet with some having
dangerous real-world consequences. A large contingent of those who participated
in the January 6th attack on the US Capitol believed fervently in the QAnon
conspiracy theory. In this work, we study the relationships amongst five
prominent conspiracy theories (QAnon, COVID, UFO/Aliens, 9-11, and Flat-Earth)
and each of their respective relationships to the news media, both mainstream
and fringe. Identifying and publishing a set of 755 different conspiracy theory
websites dedicated to our five conspiracy theories, we find that each set often
hyperlinks to the same external domains, with COVID and QAnon conspiracy theory
websites largest amount of shared connections. Examining the role of news
media, we further find that not only do outlets known for spreading
misinformation hyperlink to our set of conspiracy theory websites more often
than mainstream websites but this hyperlinking has increased dramatically
between 2018 and 2021, with the advent of QAnon and the start of COVID-19
pandemic. Using partial Granger-causality, we uncover several positive
correlative relationships between the hyperlinks from misinformation websites
and the popularity of conspiracy theory websites, suggesting the prominent role
that misinformation news outlets play in popularizing many conspiracy theories
Happenstance: Utilizing Semantic Search to Track Russian State Media Narratives about the Russo-Ukrainian War On Reddit
In the buildup to and in the weeks following the Russian Federation's
invasion of Ukraine, Russian state media outlets output torrents of misleading
and outright false information. In this work, we study this coordinated
information campaign in order to understand the most prominent state media
narratives touted by the Russian government to English-speaking audiences. To
do this, we first perform sentence-level topic analysis using the
large-language model MPNet on articles published by ten different pro-Russian
propaganda websites including the new Russian "fact-checking" website
waronfakes.com. Within this ecosystem, we show that smaller websites like
katehon.com were highly effective at publishing topics that were later echoed
by other Russian sites. After analyzing this set of Russian information
narratives, we then analyze their correspondence with narratives and topics of
discussion on the r/Russia and 10 other political subreddits. Using MPNet and a
semantic search algorithm, we map these subreddits' comments to the set of
topics extracted from our set of Russian websites, finding that 39.6% of
r/Russia comments corresponded to narratives from pro-Russian propaganda
websites compared to 8.86% on r/politics.Comment: Accepted to ICWSM 202
Cloud Watching: Understanding Attacks Against Cloud-Hosted Services
Cloud computing has dramatically changed service deployment patterns. In this
work, we analyze how attackers identify and target cloud services in contrast
to traditional enterprise networks and network telescopes. Using a diverse set
of cloud honeypots in 5~providers and 23~countries as well as 2~educational
networks and 1~network telescope, we analyze how IP address assignment,
geography, network, and service-port selection, influence what services are
targeted in the cloud. We find that scanners that target cloud compute are
selective: they avoid scanning networks without legitimate services and they
discriminate between geographic regions. Further, attackers mine
Internet-service search engines to find exploitable services and, in some
cases, they avoid targeting IANA-assigned protocols, causing researchers to
misclassify at least 15\% of traffic on select ports. Based on our results, we
derive recommendations for researchers and operators.Comment: Proceedings of the 2023 ACM Internet Measurement Conference (IMC
'23), October 24--26, 2023, Montreal, QC, Canad
LZR: Identifying Unexpected Internet Services
International audienceInternet-wide scanning is a commonly used research technique that has helped uncover real-world attacks, find cryptographic weaknesses, and understand both operator and miscreant behavior. Studies that employ scanning have largely assumed that services are hosted on their IANA-assigned ports, overlooking the study of services on unusual ports. In this work, we investigate where Internet services are deployed in practice and evaluate the security posture of services on unexpected ports. We show protocol deployment is more diffuse than previously believed and that protocols run on many additional ports beyond their primary IANA-assigned port. For example, only 3% of HTTP and 6% of TLS services run on ports 80 and 443, respectively. Services on non-standard ports are more likely to be insecure, which results in studies dramatically underestimating the security posture of Internet hosts. Building on our observations, we introduce LZR ("Laser"), a system that identifies 99% of identifiable unexpected services in five handshakes and dramatically reduces the time needed to perform application-layer scans on ports with few responsive expected services (e.g., 5500% speedup on 27017/MongoDB). We conclude with recommendations for future studies